Entropy-Based Authorship Search in Large Document Collections
نویسندگان
چکیده
The purpose of authorship search is to identify documents written by a particular author in large document collections. Standard search engines match documents to queries based on topic, and are not applicable to authorship search. In this paper we propose an approach to authorship search based on information theory. We propose relative entropy of style markers for ranking, inspired by the language models used in information retrieval. Our experiments on collections of newswire texts show that, with simple style markers and sufficient training data, documents by a particular author can be accurately found from within large collections. Although effectiveness does degrade as collection size is increased, with even 500,000 documents nearly half of the top-ranked documents are correct matches. We have also found that the authorship search approach can be used for authorship attribution, and is much more scalable than state-of-art approaches in terms of the collection size and the number of candidate authors.
منابع مشابه
Authoritative Re-Ranking in Fusing Authorship-Based Subcollection Search Results
We examine the use of authorship information to divide IR test collections into subcollections and apply techniques from the field of distributed information retrieval to enhance the baseline search results. We determine the expertise of each author, based on the content of their documents, and use this knowledge to construct rankings of the different author subcollections for each query. We go...
متن کاملAuthor Identification on the Large Scale
Individuals have distinctive ways of speaking and writing, and there exists a long history of linguistic and stylistic investigation into authorship attribution. In recent years, practical applications for authorship attribution have grown in areas such as intelligence (linking intercepted messages to each other and to known terrorists), criminal law (identifying writers of ransom notes and har...
متن کاملAn Analysis of Ministry of Education’s Strategic Plans Based on Favorable Components of English Language Teaching Using Shannon’s Entropy
The present research aims to analyze the content of Ministry of Education’s strategic plans (the Fundamental Reform Document of Education, the Comprehensive National Scientific Plan and the National Curriculum Document) based on Shannon's entropy regarding the favorable components of teaching English. The contents of the Fundamental Reform Document of Education, the Comprehensive National Scien...
متن کاملThe Reliability of Metrics Based on Graded Relevance
Improving weak ad-hoc retrieval by Web assistance and data fusion p. 17 Query expansion with the minimum relevance judgments p. 31 Improved concurrency control technique with lock-free querying for multi-dimensional index structure p. 43 A color-based image retrieval method using color distribution and common bitmap p. 56 A probabilistic model for music recommendation considering audio features...
متن کاملRetrieval from Document Image Collections
This paper presents a system for retrieval of relevant documents from large document image collections. We achieve effective search and retrieval from a large collection of printed document images by matching image features at word-level. For representations of the words, profile-based and shape-based features are employed. A novel DTWbased partial matching scheme is employed to take care of mo...
متن کامل